FA vs. PCA;
Conducting an Exploratory Factor Analysis Example;
Factor Score;
FA applied in Regression;
1- What is difference between EFA and PCA?
2- Is it possible that we can run the regression analysis on factors
generating from Factor Analysis?
3- If so, how can we run it?
The demonstration is to use the dataset Factor-Hair-Revised.csv to build a regression model to predict satisfaction.
FA | PCA | |
---|---|---|
Meaning | A factor (or latent) is a common or
underlying element with which several other variables are
correlated \(X_1 = L_1\times F + \epsilon_1\) \(X_2 = L_2\times F + \epsilon_2\) \(X_3 = L_3\times F + \epsilon_3\) where, F is the factor, W is are the loading |
A component is a derived new dimension (or
variable) so that the derived variables are linearly independent of each
other \(Y = W_1\times PC_1 + W_2\times PC_2 + ...+ W_{10}\times PC_{10} + C\) where, PCs are the components & W is are the weights |
Process | The original variables are defined as the linear combinations of the factors | The components are calculated as the linear combinations of the original variables |
Objective | FA focuses on explaining the covariances or the correlations between the variables | The aim of PCA is to explain as much of the cumulative variance in the predictors (or variables) as possible |
Purpose | Factor Analysis is used to understand the underlying ‘cause’ which these factors (latent or constituents) capture much of the information of a set of variables in the dataset data | PCA is used to decompose the data into a smaller number of components and therefore is a type of Singular Value Decomposition (SVD) |
Interpretation of coefficients | The coefficients (loadings) in the factor analysis express the relationship or association of each variable (X) to the underlying factor (F) | The coefficients indicate which component contributes more to the target variable, Y, as the independent variables are standardized |
<- read_csv("Factor-Hair-Revised.csv")
prodsurvey head(prodsurvey)
# A tibble: 6 × 13
ID ProdQual Ecom TechSup CompRes Adver…¹ ProdL…² Sales…³ ComPr…⁴
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 8.5 3.9 2.5 5.9 4.8 4.9 6 6.8
2 2 8.2 2.7 5.1 7.2 3.4 7.9 3.1 5.3
3 3 9.2 3.4 5.6 5.6 5.4 7.4 5.8 4.5
4 4 6.4 3.3 7 3.7 4.7 4.7 4.5 8.8
5 5 9 3.4 5.2 4.6 2.2 6 4.5 6.8
6 6 6.5 2.8 3.1 4.1 4 4.3 3.7 8.5
# … with 4 more variables: WartyClaim <dbl>, OrdBilling <dbl>,
# DelSpeed <dbl>, Satisfaction <dbl>, and abbreviated variable
# names ¹Advertising, ²ProdLine, ³SalesFImage, ⁴ComPricing
dim(prodsurvey)
[1] 100 13
str(prodsurvey)
spc_tbl_ [100 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ID : num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
$ ProdQual : num [1:100] 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
$ Ecom : num [1:100] 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
$ TechSup : num [1:100] 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
$ CompRes : num [1:100] 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
$ Advertising : num [1:100] 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
$ ProdLine : num [1:100] 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
$ SalesFImage : num [1:100] 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
$ ComPricing : num [1:100] 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
$ WartyClaim : num [1:100] 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
$ OrdBilling : num [1:100] 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
$ DelSpeed : num [1:100] 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
$ Satisfaction: num [1:100] 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
- attr(*, "spec")=
.. cols(
.. ID = col_double(),
.. ProdQual = col_double(),
.. Ecom = col_double(),
.. TechSup = col_double(),
.. CompRes = col_double(),
.. Advertising = col_double(),
.. ProdLine = col_double(),
.. SalesFImage = col_double(),
.. ComPricing = col_double(),
.. WartyClaim = col_double(),
.. OrdBilling = col_double(),
.. DelSpeed = col_double(),
.. Satisfaction = col_double()
.. )
- attr(*, "problems")=<externalptr>
names(prodsurvey)
[1] "ID" "ProdQual" "Ecom" "TechSup"
[5] "CompRes" "Advertising" "ProdLine" "SalesFImage"
[9] "ComPricing" "WartyClaim" "OrdBilling" "DelSpeed"
[13] "Satisfaction"
describe(prodsurvey)
vars n mean sd median trimmed mad min max
ID 1 100 50.50 29.01 50.50 50.50 37.06 1.0 100.0
ProdQual 2 100 7.81 1.40 8.00 7.85 1.78 5.0 10.0
Ecom 3 100 3.67 0.70 3.60 3.63 0.52 2.2 5.7
TechSup 4 100 5.36 1.53 5.40 5.40 1.85 1.3 8.5
CompRes 5 100 5.44 1.21 5.45 5.46 1.26 2.6 7.8
Advertising 6 100 4.01 1.13 4.00 4.00 1.19 1.9 6.5
ProdLine 7 100 5.80 1.32 5.75 5.81 1.56 2.3 8.4
SalesFImage 8 100 5.12 1.07 4.90 5.09 0.89 2.9 8.2
ComPricing 9 100 6.97 1.55 7.10 7.02 1.93 3.7 9.9
WartyClaim 10 100 6.04 0.82 6.10 6.04 0.89 4.1 8.1
OrdBilling 11 100 4.28 0.93 4.40 4.31 0.74 2.0 6.7
DelSpeed 12 100 3.89 0.73 3.90 3.92 0.74 1.6 5.5
Satisfaction 13 100 6.92 1.19 7.05 6.90 1.33 4.7 9.9
range skew kurtosis se
ID 99.0 0.00 -1.24 2.90
ProdQual 5.0 -0.24 -1.17 0.14
Ecom 3.5 0.64 0.57 0.07
TechSup 7.2 -0.20 -0.63 0.15
CompRes 5.2 -0.13 -0.66 0.12
Advertising 4.6 0.04 -0.94 0.11
ProdLine 6.1 -0.09 -0.60 0.13
SalesFImage 5.3 0.37 0.26 0.11
ComPricing 6.2 -0.23 -0.96 0.15
WartyClaim 4.0 0.01 -0.53 0.08
OrdBilling 4.7 -0.32 0.11 0.09
DelSpeed 3.9 -0.45 0.09 0.07
Satisfaction 5.2 0.08 -0.86 0.12
<- cor(prodsurvey[,-1])
datamatrix1 corrplot(datamatrix1, method="number")
As we can see from the above correlation matrix:
CompRes
) and delivery speed
(DelSpeed
) are highly correlatedOrdBilling
) and CompRes
are
highly correlatedWartyClaim
and TechSupport
are highly
correlatedOrdBilling
and DelSpeed
are highly
correlatedEcom
) and sales force image
(SalesFImage
) are highly correlatedpcor(prodsurvey[,-1], method="pearson")
$estimate
ProdQual Ecom TechSup CompRes
ProdQual 1.000000000 0.1549597037 0.002424222 -0.056354142
Ecom 0.154959704 1.0000000000 0.082359000 -0.033513777
TechSup 0.002424222 0.0823590001 1.000000000 0.143603415
CompRes -0.056354142 -0.0335137768 0.143603415 1.000000000
Advertising 0.112376746 -0.0002972504 -0.059300254 -0.064806093
ProdLine 0.281144724 0.1538660545 -0.125349050 0.020305650
SalesFImage -0.376228551 0.7321011890 -0.093310796 -0.022150933
ComPricing -0.014021386 0.0149131857 -0.132972485 -0.004487151
WartyClaim -0.042752416 -0.1232553822 0.787729606 -0.109686211
OrdBilling 0.054523215 0.1546990521 -0.165914724 0.288344382
DelSpeed -0.335084210 -0.0083917930 -0.021914916 0.528932328
Satisfaction 0.607438787 -0.3308440834 0.055111867 0.172416347
Advertising ProdLine SalesFImage ComPricing
ProdQual 0.1123767461 0.28114472 -0.376228551 -0.014021386
Ecom -0.0002972504 0.15386605 0.732101189 0.014913186
TechSup -0.0593002540 -0.12534905 -0.093310796 -0.132972485
CompRes -0.0648060931 0.02030565 -0.022150933 -0.004487151
Advertising 1.0000000000 -0.13192319 0.262879343 -0.063298038
ProdLine -0.1319231866 1.00000000 -0.230170570 -0.361757904
SalesFImage 0.2628793429 -0.23017057 1.000000000 0.126612489
ComPricing -0.0632980381 -0.36175790 0.126612489 1.000000000
WartyClaim 0.0275909587 0.25718360 0.189300271 0.020351554
OrdBilling -0.0326580430 -0.28098068 -0.182359301 -0.086500869
DelSpeed 0.2046544064 0.50189505 0.005889925 0.190437993
Satisfaction -0.0449735411 0.18325156 0.660251850 -0.087467711
WartyClaim OrdBilling DelSpeed Satisfaction
ProdQual -0.04275242 0.05452322 -0.335084210 0.60743879
Ecom -0.12325538 0.15469905 -0.008391793 -0.33084408
TechSup 0.78772961 -0.16591472 -0.021914916 0.05511187
CompRes -0.10968621 0.28834438 0.528932328 0.17241635
Advertising 0.02759096 -0.03265804 0.204654406 -0.04497354
ProdLine 0.25718360 -0.28098068 0.501895051 0.18325156
SalesFImage 0.18930027 -0.18235930 0.005889925 0.66025185
ComPricing 0.02035155 -0.08650087 0.190437993 -0.08746771
WartyClaim 1.00000000 0.25984451 -0.091649002 -0.08868071
OrdBilling 0.25984451 1.00000000 0.350530212 0.14880055
DelSpeed -0.09164900 0.35053021 1.000000000 0.08955614
Satisfaction -0.08868071 0.14880055 0.089556143 1.00000000
$p.value
ProdQual Ecom TechSup CompRes
ProdQual 0.000000e+00 1.447435e-01 9.819081e-01 5.977987e-01
Ecom 1.447435e-01 0.000000e+00 4.402823e-01 7.538375e-01
TechSup 9.819081e-01 4.402823e-01 0.000000e+00 1.769171e-01
CompRes 5.977987e-01 7.538375e-01 1.769171e-01 0.000000e+00
Advertising 2.916350e-01 9.977814e-01 5.787601e-01 5.439511e-01
ProdLine 7.269030e-03 1.476353e-01 2.391187e-01 8.493380e-01
SalesFImage 2.576212e-04 2.442012e-16 3.817042e-01 8.358301e-01
ComPricing 8.956443e-01 8.890480e-01 2.115149e-01 9.665194e-01
WartyClaim 6.890839e-01 2.471182e-01 3.262868e-20 3.034147e-01
OrdBilling 6.097697e-01 1.454288e-01 1.180864e-01 5.850710e-03
DelSpeed 1.245192e-03 9.374303e-01 8.375553e-01 8.359069e-08
Satisfaction 2.182238e-10 1.447603e-03 6.059096e-01 1.041593e-01
Advertising ProdLine SalesFImage ComPricing
ProdQual 0.29163501 7.269030e-03 2.576212e-04 0.8956443272
Ecom 0.99778145 1.476353e-01 2.442012e-16 0.8890479591
TechSup 0.57876015 2.391187e-01 3.817042e-01 0.2115148546
CompRes 0.54395113 8.493380e-01 8.358301e-01 0.9665194173
Advertising 0.00000000 2.151737e-01 1.230672e-02 0.5533828069
ProdLine 0.21517368 0.000000e+00 2.907563e-02 0.0004593443
SalesFImage 0.01230672 2.907563e-02 0.000000e+00 0.2343791374
ComPricing 0.55338281 4.593443e-04 2.343791e-01 0.0000000000
WartyClaim 0.79629803 1.440249e-02 7.394541e-02 0.8490014635
OrdBilling 0.75993047 7.304606e-03 8.538047e-02 0.4175555780
DelSpeed 0.05300070 4.664278e-07 9.560619e-01 0.0721942454
Satisfaction 0.67382405 8.383625e-02 1.449719e-12 0.4123500152
WartyClaim OrdBilling DelSpeed Satisfaction
ProdQual 6.890839e-01 0.6097697424 1.245192e-03 2.182238e-10
Ecom 2.471182e-01 0.1454287628 9.374303e-01 1.447603e-03
TechSup 3.262868e-20 0.1180864174 8.375553e-01 6.059096e-01
CompRes 3.034147e-01 0.0058507099 8.359069e-08 1.041593e-01
Advertising 7.962980e-01 0.7599304715 5.300070e-02 6.738241e-01
ProdLine 1.440249e-02 0.0073046065 4.664278e-07 8.383625e-02
SalesFImage 7.394541e-02 0.0853804743 9.560619e-01 1.449719e-12
ComPricing 8.490015e-01 0.4175555780 7.219425e-02 4.123500e-01
WartyClaim 0.000000e+00 0.0133877361 3.902770e-01 4.058729e-01
OrdBilling 1.338774e-02 0.0000000000 7.064114e-04 1.615975e-01
DelSpeed 3.902770e-01 0.0007064114 0.000000e+00 4.012356e-01
Satisfaction 4.058729e-01 0.1615974575 4.012356e-01 0.000000e+00
$statistic
ProdQual Ecom TechSup CompRes
ProdQual 0.00000000 1.471424516 0.02274128 -0.52949015
Ecom 1.47142452 0.000000000 0.77522957 -0.31456380
TechSup 0.02274128 0.775229571 0.00000000 1.36122814
CompRes -0.52949015 -0.314563798 1.36122814 0.00000000
Advertising 1.06090746 -0.002788456 -0.55726637 -0.60921569
ProdLine 2.74821968 1.460787002 -1.18522655 0.19052316
SalesFImage -3.80921125 10.081854496 -0.87916864 -0.20784517
ComPricing -0.13154519 0.139913642 -1.25856891 -0.04209363
WartyClaim -0.40142023 -1.165122048 11.99562484 -1.03519395
OrdBilling 0.51223505 1.468888754 -1.57829305 2.82489237
DelSpeed -3.33624278 -0.078724768 -0.20562952 5.84663073
Satisfaction 7.17336519 -3.288799958 0.51778207 1.64199901
Advertising ProdLine SalesFImage ComPricing
ProdQual 1.060907458 2.7482197 -3.80921125 -0.13154519
Ecom -0.002788456 1.4607870 10.08185450 0.13991364
TechSup -0.557266374 -1.1852266 -0.87916864 -1.25856891
CompRes -0.609215688 0.1905232 -0.20784517 -0.04209363
Advertising 0.000000000 -1.2484608 2.55592188 -0.59498137
ProdLine -1.248460807 0.0000000 -2.21876450 -3.64012829
SalesFImage 2.555921881 -2.2187645 0.00000000 1.19736653
ComPricing -0.594981366 -3.6401283 1.19736653 0.00000000
WartyClaim 0.258924708 2.4965744 1.80849286 0.19095404
OrdBilling -0.306523103 -2.7464786 -1.73985585 -0.81450302
DelSpeed 1.961341689 5.4434474 0.05525335 1.81976992
Satisfaction -0.422316520 1.7486638 8.24679932 -0.82367672
WartyClaim OrdBilling DelSpeed Satisfaction
ProdQual -0.4014202 0.5122350 -3.33624278 7.1733652
Ecom -1.1651220 1.4688888 -0.07872477 -3.2888000
TechSup 11.9956248 -1.5782930 -0.20562952 0.5177821
CompRes -1.0351939 2.8248924 5.84663073 1.6419990
Advertising 0.2589247 -0.3065231 1.96134169 -0.4223165
ProdLine 2.4965744 -2.7464786 5.44344736 1.7486638
SalesFImage 1.8084929 -1.7398559 0.05525335 8.2467993
ComPricing 0.1909540 -0.8145030 1.81976992 -0.8236767
WartyClaim 0.0000000 2.5242649 -0.86337748 -0.8351894
OrdBilling 2.5242649 0.0000000 3.51103503 1.4115877
DelSpeed -0.8633775 3.5110350 0.00000000 0.8435005
Satisfaction -0.8351894 1.4115877 0.84350046 0.0000000
$n
[1] 100
$gp
[1] 10
$method
[1] "pearson"
<- lm(Satisfaction ~ ., prodsurvey[,-1])
model1 summary(model1)
Call:
lm(formula = Satisfaction ~ ., data = prodsurvey[, -1])
Residuals:
Min 1Q Median 3Q Max
-1.43005 -0.31165 0.07621 0.37190 0.90120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.66961 0.81233 -0.824 0.41199
ProdQual 0.37137 0.05177 7.173 2.18e-10 ***
Ecom -0.44056 0.13396 -3.289 0.00145 **
TechSup 0.03299 0.06372 0.518 0.60591
CompRes 0.16703 0.10173 1.642 0.10416
Advertising -0.02602 0.06161 -0.422 0.67382
ProdLine 0.14034 0.08025 1.749 0.08384 .
SalesFImage 0.80611 0.09775 8.247 1.45e-12 ***
ComPricing -0.03853 0.04677 -0.824 0.41235
WartyClaim -0.10298 0.12330 -0.835 0.40587
OrdBilling 0.14635 0.10367 1.412 0.16160
DelSpeed 0.16570 0.19644 0.844 0.40124
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5623 on 88 degrees of freedom
Multiple R-squared: 0.8021, Adjusted R-squared: 0.7774
F-statistic: 32.43 on 11 and 88 DF, p-value: < 2.2e-16
ProdQual
-0.44 x
Ecom
+ 0.034 x TechSup
+ 0.16 x
CompRes
-0.02 x Advertising
+ 0.14 x
ProdLine
+ 0.80 x SalesFImage
- 0.038 x
CompPricing
- 0.10 x WartyClaim
+ 0.14 x
OrdBilling
+ 0.16 x DelSpeed
vif(model1)
ProdQual Ecom TechSup CompRes Advertising
1.635797 2.756694 2.976796 4.730448 1.508933
ProdLine SalesFImage ComPricing WartyClaim OrdBilling
3.488185 3.439420 1.635000 3.198337 2.902999
DelSpeed
6.516014
High Variable Inflation Factor (VIF) is a sign of
multicollinearity. There is no formal VIF value for determining the
presence of multicollinearity; however, in weaker models, VIF value
greater than 2.5 may be a cause of concern.
From the VIF values, we can infer that variables
CompRes
and DelSpeed
are a cause of
concern.
Remedial Measures:
Two of the most commonly used methods to deal with multicollinearity in
the model is the following.
<- prodsurvey[,-c(1,13)]
X <- prodsurvey[,13] y
Let’s check the factorability of the variables in the dataset: perform the Kaiser-Meyer-Olkin (KMO) Test.
Kaiser-Meyer-Olkin (KMO) Test and/or Bartlett’s Test is a measure of how suited your data is for Factor Analysis. The test measures sampling adequacy for each variable in the model and for the complete model. The statistic is a measure of the proportion of variance among variables that might be common variance. The lower the proportion, the more suited your data is to Factor Analysis.
<- cor(X)
datamatrix2 KMO(r=datamatrix2)
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = datamatrix2)
Overall MSA = 0.65
MSA for each item =
ProdQual Ecom TechSup CompRes Advertising
0.51 0.63 0.52 0.79 0.78
ProdLine SalesFImage ComPricing WartyClaim OrdBilling
0.62 0.62 0.75 0.51 0.76
DelSpeed
0.67
print(cortest.bartlett(datamatrix2, nrow(prodsurvey[,-1])))
$chisq
[1] 619.2726
$p.value
[1] 1.79337e-96
$df
[1] 55
Hence Factor Analysis is considered as an appropriate technique for further analysis of the data.
Plotting the eigenvalues from our factor analysis (whether it’s based on principal axis or principal components extraction), a parallel analysis involves generating random correlation matrices and after factor analyzing them, comparing the resulting eigenvalues to the eigenvalues of the observed data. The idea behind this method is thatobserved eigenvalues that are higher than their corresponding random eigenvalues are more likely to be from “meaningful factors” than observed eigenvalues that are below their corresponding random eigenvalue.
<- fa.parallel(X) parallel
Parallel analysis suggests that the number of factors = 3 and the number of components = 3
# other option of scree plot
scree(X)
Selection of factors from the scree plot can be based on:
When looking at the parallel analysis scree plots, there are two
places to look depending on which type of factor analysis you’re looking
to run. The two blue lines show you the observed eigenvalues - they
should look identical to the scree plots drawn by the scree function.
The red dotted lines show the random eigenvalues or the simulated data
line. Each point on the blue line that lies above the corresponding
simulated data line is a factor or component to extract. In this
analysis, you can see that 4 factors in the “Factor Analysis” parallel
analysis lie above the corresponding simulated data line and 3
components in the “Principal Components” parallel analysis lie above the
corresponding simulated data line.
In our case, however, the last factor/component lies very close to the
line - for both principal components extraction and principal axis
extraction. Thus, we should probably compare the 3 factor and 4 factor
solutions, to see which one is most interpretable.
<- eigen(cor(X))
ev $values ev
[1] 3.42697133 2.55089671 1.69097648 1.08655606 0.60942409 0.55188378
[7] 0.40151815 0.24695154 0.20355327 0.13284158 0.09842702
<- rep(1:11)
Factor <- ev$values
Eigen_Values <- data.frame(Factor, Eigen_Values)
Scree
ggplot(data = Scree, mapping = aes(x = Factor, y = Eigen_Values)) +
geom_point() +
geom_line() +
scale_y_continuous(name = "Eigen Values", limits = c(0,4)) +
theme_bw() +
theme(panel.grid.major.y = element_line(color = "skyblue")) +
ggtitle("Scree Plot")
So as per the elbow or Kaiser-Guttman normalization rule, we are good to go ahead with 4 factors.
<- fa(r=X,
fa.none nfactors = 4,
# covar = FALSE, SMC = TRUE,
fm="pa", # type of factor analysis we want to use (“pa” is principal axis factoring)
max.iter=100, # (50 is the default, but we have changed it to 100
rotate="none") # none rotation
print(fa.none)
Factor Analysis using method = pa
Call: fa(r = X, nfactors = 4, rotate = "none", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 PA4 h2 u2 com
ProdQual 0.20 -0.41 -0.06 0.46 0.42 0.576 2.4
Ecom 0.29 0.66 0.27 0.22 0.64 0.362 2.0
TechSup 0.28 -0.38 0.74 -0.17 0.79 0.205 1.9
CompRes 0.86 0.01 -0.26 -0.18 0.84 0.157 1.3
Advertising 0.29 0.46 0.08 0.13 0.31 0.686 1.9
ProdLine 0.69 -0.45 -0.14 0.31 0.80 0.200 2.3
SalesFImage 0.39 0.80 0.35 0.25 0.98 0.021 2.1
ComPricing -0.23 0.55 -0.04 -0.29 0.44 0.557 1.9
WartyClaim 0.38 -0.32 0.74 -0.15 0.81 0.186 2.0
OrdBilling 0.75 0.02 -0.18 -0.18 0.62 0.378 1.2
DelSpeed 0.90 0.10 -0.30 -0.20 0.94 0.058 1.4
PA1 PA2 PA3 PA4
SS loadings 3.21 2.22 1.50 0.68
Proportion Var 0.29 0.20 0.14 0.06
Cumulative Var 0.29 0.49 0.63 0.69
Proportion Explained 0.42 0.29 0.20 0.09
Cumulative Proportion 0.42 0.71 0.91 1.00
Mean item complexity = 1.9
Test of the hypothesis that 4 factors are sufficient.
The degrees of freedom for the null model are 55 and the objective function was 6.55 with Chi Square of 619.27
The degrees of freedom for the model are 17 and the objective function was 0.33
The root mean square of the residuals (RMSR) is 0.02
The df corrected root mean square of the residuals is 0.03
The harmonic number of observations is 100 with the empirical chi square 3.19 with prob < 1
The total number of observations was 100 with Likelihood Chi Square = 30.27 with prob < 0.024
Tucker Lewis Index of factoring reliability = 0.921
RMSEA index = 0.088 and the 90 % confidence intervals are 0.032 0.139
BIC = -48.01
Fit based upon off diagonal values = 1
Measures of factor score adequacy
PA1 PA2 PA3 PA4
Correlation of (regression) scores with factors 0.98 0.97 0.95 0.88
Multiple R square of scores with factors 0.96 0.95 0.91 0.78
Minimum correlation of possible factor scores 0.92 0.90 0.82 0.56
<- factanal(X, factors=4, scores = c("regression"), rotation = "none")
factanal.none print(factanal.none)
Call:
factanal(x = X, factors = 4, scores = c("regression"), rotation = "none")
Uniquenesses:
ProdQual Ecom TechSup CompRes Advertising
0.682 0.360 0.228 0.178 0.679
ProdLine SalesFImage ComPricing WartyClaim OrdBilling
0.005 0.017 0.636 0.163 0.347
DelSpeed
0.076
Loadings:
Factor1 Factor2 Factor3 Factor4
ProdQual 0.467 -0.148 -0.229 -0.159
Ecom 0.791
TechSup 0.198 -0.482 0.707
CompRes 0.589 0.316 0.535 0.300
Advertising 0.556 0.110
ProdLine 0.997
SalesFImage 0.987
ComPricing -0.491 0.252 0.238
WartyClaim 0.279 0.124 -0.487 0.712
OrdBilling 0.452 0.275 0.493 0.360
DelSpeed 0.629 0.364 0.582 0.241
Factor1 Factor2 Factor3 Factor4
SS loadings 2.522 2.316 1.469 1.322
Proportion Var 0.229 0.211 0.134 0.120
Cumulative Var 0.229 0.440 0.573 0.694
Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 24.26 on 17 degrees of freedom.
The p-value is 0.113
<- factanal(X, factors=4, scores = c("regression"), rotation = "varimax")
factanal.var print(factanal.var)
Call:
factanal(x = X, factors = 4, scores = c("regression"), rotation = "varimax")
Uniquenesses:
ProdQual Ecom TechSup CompRes Advertising
0.682 0.360 0.228 0.178 0.679
ProdLine SalesFImage ComPricing WartyClaim OrdBilling
0.005 0.017 0.636 0.163 0.347
DelSpeed
0.076
Loadings:
Factor1 Factor2 Factor3 Factor4
ProdQual 0.557
Ecom 0.793
TechSup 0.872 0.102
CompRes 0.884 0.142 0.135
Advertising 0.190 0.521 -0.110
ProdLine 0.502 0.104 0.856
SalesFImage 0.119 0.974 -0.130
ComPricing 0.225 -0.216 -0.514
WartyClaim 0.894 0.158
OrdBilling 0.794 0.101 0.105
DelSpeed 0.928 0.189 0.164
Factor1 Factor2 Factor3 Factor4
SS loadings 2.592 1.977 1.638 1.423
Proportion Var 0.236 0.180 0.149 0.129
Cumulative Var 0.236 0.415 0.564 0.694
Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 24.26 on 17 degrees of freedom.
The p-value is 0.113
Factor analysis results are typically interpreted in terms of the major loadings on each factor. These structures may be represented an “interpretable” solution as a table of loadings or graphically, where all loadings with an absolute value \(\rightarrow\) some cut point are represented as an edge (path).
fa.diagram(fa.none)
<- fa(r=X,
fa.var nfactors = 4,
# covar = FALSE, SMC = TRUE,
fm="pa", # type of factor analysis we want to use (“pa” is principal axis factoring)
max.iter=100, # (50 is the default, but we have changed it to 100
rotate="varimax") # none rotation
print(fa.var)
Factor Analysis using method = pa
Call: fa(r = X, nfactors = 4, rotate = "varimax", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 PA4 h2 u2 com
ProdQual 0.02 -0.07 0.02 0.65 0.42 0.576 1.0
Ecom 0.07 0.79 0.03 -0.11 0.64 0.362 1.1
TechSup 0.02 -0.03 0.88 0.12 0.79 0.205 1.0
CompRes 0.90 0.13 0.05 0.13 0.84 0.157 1.1
Advertising 0.17 0.53 -0.04 -0.06 0.31 0.686 1.2
ProdLine 0.53 -0.04 0.13 0.71 0.80 0.200 1.9
SalesFImage 0.12 0.97 0.06 -0.13 0.98 0.021 1.1
ComPricing -0.08 0.21 -0.21 -0.59 0.44 0.557 1.6
WartyClaim 0.10 0.06 0.89 0.13 0.81 0.186 1.1
OrdBilling 0.77 0.13 0.09 0.09 0.62 0.378 1.1
DelSpeed 0.95 0.19 0.00 0.09 0.94 0.058 1.1
PA1 PA2 PA3 PA4
SS loadings 2.63 1.97 1.64 1.37
Proportion Var 0.24 0.18 0.15 0.12
Cumulative Var 0.24 0.42 0.57 0.69
Proportion Explained 0.35 0.26 0.22 0.18
Cumulative Proportion 0.35 0.60 0.82 1.00
Mean item complexity = 1.2
Test of the hypothesis that 4 factors are sufficient.
The degrees of freedom for the null model are 55 and the objective function was 6.55 with Chi Square of 619.27
The degrees of freedom for the model are 17 and the objective function was 0.33
The root mean square of the residuals (RMSR) is 0.02
The df corrected root mean square of the residuals is 0.03
The harmonic number of observations is 100 with the empirical chi square 3.19 with prob < 1
The total number of observations was 100 with Likelihood Chi Square = 30.27 with prob < 0.024
Tucker Lewis Index of factoring reliability = 0.921
RMSEA index = 0.088 and the 90 % confidence intervals are 0.032 0.139
BIC = -48.01
Fit based upon off diagonal values = 1
Measures of factor score adequacy
PA1 PA2 PA3 PA4
Correlation of (regression) scores with factors 0.98 0.99 0.94 0.88
Multiple R square of scores with factors 0.96 0.97 0.88 0.78
Minimum correlation of possible factor scores 0.93 0.94 0.77 0.55
$loadings fa.var
Loadings:
PA1 PA2 PA3 PA4
ProdQual 0.647
Ecom 0.787 -0.113
TechSup 0.883 0.116
CompRes 0.898 0.130 0.132
Advertising 0.166 0.530
ProdLine 0.525 0.127 0.712
SalesFImage 0.115 0.971 -0.135
ComPricing 0.213 -0.209 -0.590
WartyClaim 0.103 0.885 0.128
OrdBilling 0.768 0.127
DelSpeed 0.949 0.185
PA1 PA2 PA3 PA4
SS loadings 2.635 1.967 1.641 1.371
Proportion Var 0.240 0.179 0.149 0.125
Cumulative Var 0.240 0.418 0.568 0.692
fa.diagram(fa.var)
Factors | Variables | Label | Short Interpretaation |
---|---|---|---|
PA1 | DelSpeed, CompRes, OrdBilling | Purchase | All the items are related to purchasing the product; from placing order to billing and getting it delivered. |
PA2 | SalesFImage, Ecom, Advertising | Marketing | In this factors the item are related to marketing processes like the image of sales force and spending on adverstising |
PA3 | WartyClaim, TechSup | Post Purchase | Post purchase activities are included in this factor; like warranty claims and technical support. |
PA4 | ProdLine, ProdQual, CompPricing | Product Position | Product positioning related items are grouped in this factor. |
head(fa.var$scores)
PA1 PA2 PA3 PA4
[1,] -0.1338871 0.9175166 -1.719604873 0.09135411
[2,] 1.6297604 -2.0090053 -0.596361722 0.65808192
[3,] 0.3637658 0.8361736 0.002979966 1.37548765
[4,] -1.2225230 -0.5491336 1.245473305 -0.64421384
[5,] -0.4854209 -0.4276223 -0.026980304 0.47360747
[6,] -0.5950924 -1.3035333 -1.183019401 -0.95913571
#fa.var$scores # for 100 observations
<- cbind(prodsurvey["Satisfaction"], fa.var$scores)
regdata
#Labeling the data
names(regdata) <- c("Satisfaction", "Purchase", "Marketing",
"Post_purchase", "Prod_positioning")
head(regdata)
Satisfaction Purchase Marketing Post_purchase Prod_positioning
1 8.2 -0.1338871 0.9175166 -1.719604873 0.09135411
2 5.7 1.6297604 -2.0090053 -0.596361722 0.65808192
3 8.9 0.3637658 0.8361736 0.002979966 1.37548765
4 4.8 -1.2225230 -0.5491336 1.245473305 -0.64421384
5 7.1 -0.4854209 -0.4276223 -0.026980304 0.47360747
6 4.7 -0.5950924 -1.3035333 -1.183019401 -0.95913571
#Splitting the data 70:30
#Random number generator, set seed.
set.seed(100)
= sample(1:nrow(regdata), 0.7*nrow(regdata))
indices=regdata[indices,]
train= regdata[-indices,] test
= lm(Satisfaction~., train)
model.fa.score summary(model.fa.score)
Call:
lm(formula = Satisfaction ~ ., data = train)
Residuals:
Min 1Q Median 3Q Max
-1.76280 -0.48717 0.06799 0.46459 1.24022
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.91852 0.08068 85.757 < 2e-16 ***
Purchase 0.50230 0.07641 6.574 9.74e-09 ***
Marketing 0.75488 0.08390 8.998 5.00e-13 ***
Post_purchase 0.08755 0.08216 1.066 0.291
Prod_positioning 0.58074 0.08781 6.614 8.30e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6629 on 65 degrees of freedom
Multiple R-squared: 0.7261, Adjusted R-squared: 0.7093
F-statistic: 43.08 on 4 and 65 DF, p-value: < 2.2e-16
vif(model.fa.score)
Purchase Marketing Post_purchase Prod_positioning
1.009648 1.008235 1.015126 1.024050
#Model Performance metrics:
<- predict(model.fa.score, newdata = test, type = "response")
pred_test pred_test
6 8 11 13 17 19 26
4.975008 5.908267 6.951629 8.677431 6.613838 6.963113 6.313513
33 34 35 37 40 42 44
6.141338 6.158993 7.415742 6.589746 6.858206 7.133989 8.533080
49 50 53 56 57 60 65
8.765145 8.078744 7.395438 7.468360 8.744402 6.276660 5.936570
67 71 73 75 80 96 97
6.650322 8.299545 7.685564 7.330191 6.719528 7.540233 6.143172
99 100
8.084583 5.799897
#Find MSE and MAPE scores:
#MSE/ MAPE of Model1
$Satisfaction_Predicted <- pred_test
testhead(test[c("Satisfaction","Satisfaction_Predicted")], 10)
Satisfaction Satisfaction_Predicted
6 4.7 4.975008
8 6.3 5.908267
11 7.4 6.951629
13 8.4 8.677431
17 6.4 6.613838
19 6.8 6.963113
26 6.6 6.313513
33 5.4 6.141338
34 7.3 6.158993
35 6.3 7.415742
<- cor(test$Satisfaction, test$Satisfaction_Predicted) ^2
test_r2
<- mse(test$Satisfaction, pred_test)
mse_test <- sqrt(mse(test$Satisfaction, pred_test))
rmse_test <- mape(test$Satisfaction, pred_test)
mape_test <- cbind(mse_test,rmse_test,mape_test,test_r2)
model_metrics print(model_metrics, 3)
mse_test rmse_test mape_test test_r2
[1,] 0.551 0.742 0.0875 0.554
As the feature Post_purchase
is not significant so we
will drop this feature and then let’s run the regression model
again.
##Regression model without post_purchase:
<- lm(Satisfaction ~ Purchase+ Marketing+
model2 data= train)
Prod_positioning, summary(model2)
Call:
lm(formula = Satisfaction ~ Purchase + Marketing + Prod_positioning,
data = train)
Residuals:
Min 1Q Median 3Q Max
-1.7268 -0.5036 0.1117 0.4408 1.2962
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.92158 0.08071 85.761 < 2e-16 ***
Purchase 0.49779 0.07637 6.518 1.15e-08 ***
Marketing 0.75870 0.08391 9.042 3.66e-13 ***
Prod_positioning 0.59084 0.08739 6.761 4.30e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6636 on 66 degrees of freedom
Multiple R-squared: 0.7213, Adjusted R-squared: 0.7087
F-statistic: 56.95 on 3 and 66 DF, p-value: < 2.2e-16
<- predict(model2, newdata = test, type = "response")
pred_test2 pred_test2
6 8 11 13 17 19 26
5.069667 5.958765 7.001694 8.695630 6.521440 6.994018 6.249667
33 34 35 37 40 42 44
6.123357 6.132165 7.415882 6.566556 6.964570 7.233376 8.424720
49 50 53 56 57 60 65
8.853060 8.177009 7.413693 7.404078 8.746511 6.215816 5.893515
67 71 73 75 80 96 97
6.476465 8.227837 7.839941 7.354907 6.734279 7.559995 6.293774
99 100
8.083909 5.832094
$Satisfaction_Predicted2 <- pred_test2
testhead(test[c(1,7)], 10)
Satisfaction Satisfaction_Predicted2
6 4.7 5.069667
8 6.3 5.958765
11 7.4 7.001694
13 8.4 8.695630
17 6.4 6.521440
19 6.8 6.994018
26 6.6 6.249667
33 5.4 6.123357
34 7.3 6.132165
35 6.3 7.415882
<- cor(test$Satisfaction, test$Satisfaction_Predicted2) ^2
test_r22 <- mse(test$Satisfaction, pred_test2)
mse_test2 <- sqrt(mse(test$Satisfaction, pred_test2))
rmse_test2 <- mape(test$Satisfaction, pred_test2)
mape_test2
<- cbind(mse_test2,rmse_test2,mape_test2,test_r22)
model2_metrics model2_metrics
mse_test2 rmse_test2 mape_test2 test_r22
[1,] 0.5446039 0.7379728 0.08490197 0.5595099
<- rbind(model_metrics,model2_metrics)
Overall row.names(Overall) <- c("Test1", "Test2")
colnames(Overall) <- c("MSE", "RMSE", "MAPE", "R-squared")
print(Overall,3)
MSE RMSE MAPE R-squared
Test1 0.551 0.742 0.0875 0.554
Test2 0.545 0.738 0.0849 0.560
Including Interaction model, we are able to make a better prediction!??
We saw how Factor Analysis can be used to reduce the dimensionality of a dataset and then we used multiple linear regression on the dimensionally reduced columns/Features for further analysis/predictions.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2022, Sept. 30). HaiBiostat: Factor Analysis applied in Multivariate Regression: Machine Learning's Perspective. Retrieved from https://hai-mn.github.io/posts/2022-09-30-FA in Multivariate Regression/
BibTeX citation
@misc{nguyen2022factor, author = {Nguyen, Hai}, title = {HaiBiostat: Factor Analysis applied in Multivariate Regression: Machine Learning's Perspective}, url = {https://hai-mn.github.io/posts/2022-09-30-FA in Multivariate Regression/}, year = {2022} }